Guessing the Correct Inflectional Paradigm of Unknown Croatian Words
نویسنده
چکیده
A real-life morphological analyzer must be able to handle properly the out-of-vocabulary words. We address the task of guessing the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a model for deciding whether a candidate lemma-paradigm pair is correct based on a number of stringand corpus-based features. Our aim is to examine the machine learning aspect of the problem: we analyze the features and evaluate the classification accuracy using different feature subsets. We show that satisfactory level of accuracy (92%) can be achieved with SVM using a combination of stringand corpus-based features. We discuss a number of possible directions for future research. Ugibanje pravilne pregibne paradigme za neznane hrvaške besede Uporaben morfološki analizator mora znati pravilno obravnavati tudi besede, ki jih nima v leksikonu. Prispevek je posvečen ugibanju pravilne pregibne paradigme za neznane hrvaške besede z uporabo nadzorovanega strojnega učenja. Model se odloči, ali je kandidat oz. par lema-paradigma, pravilen glede na večje število lastnosti, ki temeljijo na nizih in korpusu. Namen prispevka je, da preuči razne vidike strojnega učenja tega problema: analiziramo uporabljene lastnosti in ovrednotimo natančnost klasifikacije glede na različne podmnožice lastnosti. Pokažemo, da lahko zadovoljivo raven natančnosti (92%) dosežemo s SVM in z uporabo kombinacije lastnosti nizov in korpusa. Obravnavamo tudi več smernic za nadaljnje delo.
منابع مشابه
Predicting Inflectional Paradigms and Lemmata of Unknown Words for Semi-automatic Expansion of Morphological Lexicons
In this paper we describe a semi-automated approach to extend morphological lexicons by defining the prediction of the correct inflectional paradigm and the lemma for an unknown word as a supervised ranking task trained on an already existing lexicon. While most ranking approaches rely only on heuristics based on a single information source, our predictor uses hundreds of features calculated on...
متن کاملNew Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian
In this paper we present newly developed inflectional lexcions and manually annotated corpora of Croatian and Serbian. We introduce hrLex and srLex—two freely available inflectional lexicons of Croatian and Serbian—and describe the process of building these lexicons, supported by supervised machine learning techniques for lemma and paradigm prediction. Furthermore, we introduce hr500k, a manual...
متن کاملThe Impact of Correction for Guessing Formula on MC and Yes/No Vocabulary Tests' Scores
A standard correction for random guessing (cfg) formula on multiple-choice and Yes/Noexaminations was examined retrospectively in the scores of the intermediate female EFL learners in an English language school. The correctionwas a weighting formula for points awarded for correct answers,incorrect answers, and unanswered questions so that the expectedvalue of the increase in test score due to g...
متن کاملGuessers for Finite-State Transducer Lexicons
Language software applications encounter new words, e.g., acronyms, technical terminology, names or compounds of such words. In order to add new words to a lexicon, we need to indicate their inflectional paradigm. We present a new generally applicable method for creating an entry generator, i.e. a paradigm guesser, for finite-state transducer lexicons. As a guesser tends to produce numerous sug...
متن کاملCombining Part-of-Speech Tagger and Inflectional Lexicon for Croatian
This paper investigates several methods of combining output of a second order hidden Markov model part-of-speech/morphosyntactic tagger and a high-coverage inflectional lexicon for Croatian. Our primary motivation was to improve overall tagging accuracy of Croatian texts by using our newly-developed tagger. We also wanted to compare its tagging results – both standalone and utilizing the morpho...
متن کامل